Exploring the Simpson’s Paradox within the Penguin Dataset

And simultaneously demonstrating the capabilities of Quarto.

This document is a short analysis of the Penguin Datasets. It explores the relationship between bill length and bill depth and show how important it is to consider group effects.
Author
Affiliation

Alberto F Cabrera

University of Maryland

Published

August 26, 2024

A few considerations about this doc

This Quarto document serves as a practical illustration of the concepts covered in the productive workflow online course

1 Introduction

This document offers a straightforward analysis of the well-known penguin dataset. It is designed to complement the Productive R Workflow online course.

You can read more about the penguin dataset here.

Let’s load libraries before we start!

Show the code
library(tidyverse)
library(hrbrthemes)    # ipsum theme for ggplot2 charts
library(patchwork)     # combine charts together

1.1 Loading data

The dataset has already been loaded and cleaned in the previous step of this pipeline.

Let’s load the clean version, together with a few functions available in functions.R.

Show the code
# Source functions
source(file = "functions.R")

# Read the clean dataset
data <- readRDS(file = "../data/data_clean.rds")

Appraising penguins’ bill length and depth

1.2 Bill Length and Bill Depth

Now, let’s make some descriptive analysis, including summary statistics and graphs.

What’s striking is the slightly negative relationship between bill length and bill depth:

\[{\displaystyle Avg={\frac {1}{n}}\sum _{i=1}^{n}a_{i}={\frac {a_{1}+a_{2}+\cdots +a_{n}}{n}}}\]

Show the code
library(hrbrthemes)

palmerpenguins::penguins |> 
  filter(!is.na(sex)) |> 
  ggplot(
    aes(x = bill_length_mm, y = bill_depth_mm)
  ) +
    geom_point(color = "#69b3a2") +
    labs(
      x = "Bill Length (mm)",
      y = "Bill Depth (mm)",
      title = paste("Surprising relationship?")
    ) + 
  theme_ipsum()

Relationship between bill length and bill depth. All data points included.

It is also interesting to note that bill length a and bill depth are quite different from one specie to another. This is summarized in the 2 tables below:

Show the code
data %>%
 group_by(species) %>% 
  summarise(average_bill_length = mean(bill_length_mm, na.rm = TRUE))
data %>%
 group_by(species) %>% 
  summarise(average_bill_depth = mean(bill_depth_mm, na.rm = TRUE))
# A tibble: 3 × 2
  species   average_bill_length
  <chr>                   <dbl>
1 Adelie                   38.8
2 Chinstrap                48.8
3 Gentoo                   47.5
# A tibble: 3 × 2
  species   average_bill_depth
  <chr>                  <dbl>
1 Adelie                  18.3
2 Chinstrap               18.4
3 Gentoo                  15.0
The three plots are not displayed correctly unlike the author’s rendition

Now, let’s check the relationship between bill depth and bill length for the specie Adelie on the island Torgersen:

Show the code
# Use the function in functions.R
p1 <- create_scatterplot(data, "Adelie", "Torgersen")
p2 <- create_scatterplot(data, "Chinstrap", "Biscoe")
p3 <- create_scatterplot(data, "Gentoo", "Dream")

(p1 + p2) / p3

1.2.1 Displaying penguins data as a DT table

Show the code
library(palmerpenguins)
library(tidyverse)
library(DT)

data_clean <- readRDS("~/Documents/R-books/productive-r-workflow/data/data_clean.rds")

datatable(data_clean, filter = "top")

Making scatterplot interactive

Show the code
library(tidyverse)
library(plotly)


library(hrbrthemes)

penguins <- palmerpenguins::penguins |> 
  filter(!is.na(sex)) |> 
  ggplot(
    aes(x = bill_length_mm, y = bill_depth_mm)
  ) +
    geom_point(color = "#69b3a2") +
    labs(
      x = "Bill Length (mm)",
      y = "Bill Depth (mm)",
      title = paste("Surprising relationship?")
    ) + 
  theme_ipsum()

ggplotly(penguins)

Using a kable table

Show the code
library(tidyverse)
library(knitr)
library(palmerpenguins)


bill_length_per_specie <- palmerpenguins::penguins |> 
 group_by(species)  |>  
  summarise(average_bill_length = mean(bill_length_mm, na.rm = TRUE))

kable(bill_length_per_specie)
species average_bill_length
Adelie 38.79139
Chinstrap 48.83382
Gentoo 47.50488
Show the code
bill_depth_per_specie <- palmerpenguins::penguins %>%
 group_by(species) %>% 
  summarise(average_bill_depth = mean(bill_depth_mm, na.rm = TRUE))

kable(bill_depth_per_specie)
species average_bill_depth
Adelie 18.34636
Chinstrap 18.42059
Gentoo 14.98211
Show the code
bill_length_adelie <- bill_length_per_specie %>%
  filter(species == "Adelie") %>%
  pull(average_bill_length) %>%
  round(2)

For instance, the average bill length for the specie Adelie is 38.79.